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Abstract 

Most works related to unithood were con- 
ducted as part of a larger effort for the de- 
termination of termhood. Consequently, 
the number of independent research that 
study the notion of unithood and produce 
dedicated techniques for measuring unit- 
hood is extremely small. We propose 
a new approach, independent of any in- 
fluences of termhood, that provides ded- 
icated measures to gather linguistic evi- 
dence from parsed text and statistical ev- 
idence from Google search engine for the 
measurement of unithood. Our evalua- 
tions revealed a precision and recall of 
98.68% and 91.82% respectively with an 
accuracy at 95.42% in measuring the unit- 
hood of 1005 test cases. 

1 Introduction 

Terms and the tasks related to their treatments are 
an integral part of many applications that deal with 
natural language text such as large-scale search 
engines, automatic thesaurus construction, ma- 
chine translation and ontology learning for pur- 
poses ranging from indexing to cluster analysis. 
With the increasing reliance on huge text sources 
such as the World Wide Web as input, the need to 
provide automated means for managing domain- 
specific terms rises. Such relevance and impor- 
tance of terms has prompted dedicated research 
interests. Various names such as automatic term 
recognition, term extraction and terminology min- 
ing were given to encompass the tasks related to 
the treatment of terms. Term extraction is the pro- 
cess of extracting lexical units from text and fil- 
tering them for the purpose of identifying terms 
which characterise certain domains of interest. 
This process is the determination of two important 
factors, namely, unithood and termhood. Unit- 
hood concerns with whether or not a sequence of 



words should be combined to form a more stable 
lexical unit, and termhood measures the degree 
to which these stable lexical units are related to 
domain-specific concepts. While unithood is only 
relevant to complex terms (i.e. multi-word terms), 
termhood concerns both simple terms (i.e. single- 
word terms) and complex terms. 

Most research in automatic term recognition 
were conducted solely to study and develop tech- 
niques for measuring termhood, while only a small 
number exists that study on unithood. Unfortu- 
nately, rather than considering the measurement 
of unithood as an important prerequisite, these 
researchers merely treat it as part of a larger 
scoring and filtering mechanism for determining 
termhood. Such per cept i on is clearly reflected 
through the words of iKiti (120021) . "...we can see 
that the unithood is actually subsumed, in general, 
by the termhood." . Consequently, the significance 
of unithood measures has been overshadowed by 
the larger notion of termhood. As such, the 
progress and innovation with respect to this small 
sub-field of automatic term recognition is mini- 
mal. Most of the existing techniques for measur- 
ing unithood employ conventional measures such 
as mutual information and log-likelihood, and rely 
simply on the occurrence and co-occurrence fre- 
quencies from domain corpora as the source of ev- 
idence. 

In this paper, we propose the separation of unit- 
hood measurements from the determination of ter- 
mhood. From here on, we will consider unithood 
measurement as an important prerequisite, rather 
than a subsumption, to the determination of ter- 
mhood. We present a new dedicated approach 
for determining the unithood of word sequences 
by employing the Google search engine as the 
source of statistical evidence, and measures in- 



spired by mutual information (C hurch and Hanks 
(ll990j)) and Cvalue dprantzil (ll997|)). The use 
of the World Wide Web to replace the conven- 



tional use of static corpora will eliminate issues 
related to portability to other domains, and the 
size of text necessary to induce the required sta- 
tistical evidence. Besides, this dedicated approach 
to determine the unithood of word sequences will 
prove to be invaluable to other areas in natural 
language processing such as noun-phrase chunk- 
ing and named-entity recognition. Our evaluations 
revealed a precision and a recall of 98.68% and 
91.82% respectively with an accuracy at 95.42% 
in measuring the unithood of 1005 test cases. 

In Section 2, we have a brief review on the exist- 
ing techniques for measuring unithood. In Section 
3, we present our new approach, the measures in- 
volved and the justification behind every aspect of 
the measures. In Section 4, we summarize some 
findings from our evaluations. We discuss in Sec- 
tion 5 why our new approach can be applicable 
to other tasks in natural language processing such 
as named-entity recognition. Finally, we conclude 
this paper with an outlook to future works in Sec- 
tion 6. 

2 Related Works 



Prior to measuring unithood, term candidates must 
be extracted. There are two common approaches 
for extracting the term candidates. The first re- 
quires the corpus to be tagged or parsed, and a fil- 
ter is then employed to extract words or phrases 
satisfying some linguistic patterns. There are two 
types of filters for extracting from tagged cor- 
pus, namely, open or closed. Too restricted fil- 
ters (i.e. closed) that rely on a small set of allow- 
able part-of-spee ch will produce high p recisio n 
but poor recall (IFrantzi and Ananiadoul (119971) ). 
On the other hand, filters that are too liberal (i.e. 
open), allowing part-of-speech such as preposi- 
tions and adjectives, will have the opposite ef- 
fect. Most of the existing approaches rely on 
regular expressions and the part-of-speech tags 
to accept or reject sequences of n-grams as term 
candid ates. For example, iFrantzi and Ananiadoul 
(119971) employ Brill tagger to tag the raw cor- 
pus with part-of-speech and later extract n-grams 
that fu lfill the pattern (No un\A d jective)^Noun. 
Bourigault and Jacqueminl ( 19991) utilise SYLEX, 
a part-of-speech tagger, to tag the raw corpus. 
The part-of-speech tags are utilised to extract 
maximal-length noun phrases, which are later re- 
cursively decomposed into heads and modifiers . 
On the other extreme. Pagan and Churclj (1994) 



accept only sequences of Noun+. The second type 
of extraction approaches works on raw corpus us- 
ing a set of heuristics. This type of approaches 
which does not rely on part-of-speech tags is quite 
rare. Such approaches have to make use of the tex- 
tual surface constraints to approximate the bound- 
aries of term candidates. One of the constraints 
include the use of a stopword list to obtain the 
boundaries of stopwords for inferring the bound- 
aries of candidates. A selection list of allowable 
prepositions can also be employed to enforce con- 
straints on the tokens between units. 

The filters for extracting term candidates make 
use of only local, surface-level information, 
namely, the part-of-speech tags. More evidence 
is required to establish the dependence between 
the constituents of each term candidate to ensure 
strong unithood. Such evidence will usually be 
statistical in nature in the form of co-occurrences 
of the constituents in the corpus. Accordingly, the 
unithood of the term candidates can be determined 
either as a separate step or may proceed as part 
of the extraction process. From our review of the 
literature, only an extremely small number of re- 
searchers actually discussed and presented mea- 
sures for unithood. The lack of extensive research 
and techniqu es related specifically to unithood is 
reaffirmed by lKiti (120021) . According to the author, 
"...its measure (if there is one) indicates how likely 
it is that a term candidate is an atomic text unit." 

Two of the most common measures of unithood 
have to be pointw i se mu tual information (Ml) 
IChurch and Hanks' and log-likelihood ra- 

tio (Dunning (1994)). In mutual information, the 
co-occurrence frequencies of the constituents of 
complex terms are utilised to measure their depen- 
dency. The mutual information for two words a 
and b is defined as: 



MI{a,b) = log2 



Pia,b) 
p{a)p{b) 



(1) 



where p{a) and p{b) are the probabilities of oc- 
currence of a and b. Many measures that ap- 
ply statistical techniques assuming strict normal 
distribution, and independence between the word 
occurrences do not fare well. For handling ex- 
tremely uncommon words or small sized cor- 
pus, log-likelihood ratio d elivers the be st preci- 
sion dKurz and Xul (|2002|) : iFranj (Il997l) ). Log- 
likelihood ratio attempts to quantify how much 
more likely one pair of words is to occur compared 
to the others. Despite its potential, "How to apply 



this statistic measure to quantify structural depen- 
dency of a word s equence re mains an interesting 
is sue to explore." (IKili (120021) ). 



Frantzil (119971) proposed a measure known as 



Cvalue for extracting complex terms. The mea- 
sure is based upon the claim that a substring of 
a term candidate is a candidate itself given that 
it demonstrates adequate independence from the 
longer version it appears in. For example, "E. 
colifood poisoning", "E. coli" and 'food poison- 
ing" are acceptable as valid complex term candi- 
dates. However, "E. colifood" is not. Therefore, 
some measures are required to gauge the strength 
of word combinations to decide whether two word 
sequences should be merged or not. Given a 
word sequence a to be examined for unithood, the 
Cvalue is defined as: 



Cvalue{a) 



log2 



log2 



■fa 
■{fa 



if \a\ =g 
otherwise 
(2) 

where \a\ is the number of words in a. La is the 
set of longer term candidates that contain a, g 
is the longest n-gram considered, fa is the fre- 
quency of occurr ence of a, a nd a ^ La. While cer- 
tain researchers (IKiti (120021) ) consider Cvalue as 
a term hood measure, others (|Nakagawa and Mori 
(I2OO2 )) accept it as a measure for unithood. One 
can observe that longer candidates tend to gain 
higher weights due to the inclusion of log2\a\ in 
Equation 121 In addition, the weights computed us- 
ing Equation |2] are purely dependent on the fre- 
quency of a. 

3 A New Approach for Unithood 
Measurement 

Our new approach for measuring the unithood of 
word sequences consists of two parts. Firstly, a 
list of word sequences is extracted using purely 
linguistic techniques. Secondly, word sequences 
are examined and the related statistical evidence is 
gathered to assist in determining their mutual in- 
formation and independence. 

3.1 Extracting Word Sequences 

Existing techniques for extracting word sequences 
have been relying on part-of-speech information 
and filters in the form of pattern matching (e.g. 
regular expression). Since the head-modifiers 
principle is important for our techniques, we em- 
ploy both the part-of-speech information and de- 
pendency relation for extracting term candidates. 



The filter is implemented as a head-driven left- 
right filter ( Wong (2 005|)) that feeds on the outp ut 
of Stanford Parser (IKlein and ManningI (120031) ). 
which is an implementation of unlexicalised prob- 
abilistic context-free grammar (PCFG) and lexi- 
cal dependency parser. The head-driven filter be- 
gins by identifying a list of head nouns from the 
output of the Stanford Parser. As the name sug- 
gests, the filter begins from the head and pro- 
ceeds to the left and later, right in the attempt 
to identify maximal-length noun phrases accord- 
ing to the head-modifier information. During the 
process, the filter will append or prepend any im- 
mediate modifier of the current head which is a 
noun (except possessive nouns), an adjective or a 
foreign word. Each noun phrase or segment of 
noun phrase identified using the head-driven fil- 
ter is known as a potential term candidate, a,- S A 
where / is the word offset produced by the Stan- 
ford Parser (i.e. the "offset" column in Figure[T]). 

Figure [T] shows the output of the Stanford 
Parser for the sentence "They're living longer 
with HIV in the brain, explains Kathy Kopnisky 
of the NIH's National Institute of Mental Health, 
which is spending about millions investigating 
neuroAIDS." . Note that the words are lemma- 
tised to obtain the root form. The head nouns are 
marked with squares in the figure. For example, 
the head "Institute" is modified by "NIH's", "Na- 
tional" and "of" . Since we do not allow for mod- 
ifiers of the type "possessive" and "preposition", 
we will obtain "National Institute" as shown in 
Figure [2l Figure [2] shows the head-driven filter at 
work for some of the head nouns identified from 
the "modifiee " column of the output in Figure [U 
After the head-driven filter has identified potential 
term candidates using the heads, remaining nouns 
from the "word" column in Figure [T] which are 
not part of any potential term candidates will be 
included in A. 

3.2 Determining the Unitliood of Word 
Sequences 

In the following step, we examine the unithood of 
all pairs of potential term candidates {ax, ay) € A 
with ax and ay located immediately next to each 
other (i.e. x + 1 = 3^), or separated by a preposition 
or coordinating conjunction "and" (i.e. x + 2 = 3^). 
Obviously, % has to appear before ay in the sen- 
tence or in other words, x < y for all pairs where 
X and y are the word offsets produced by the Stan- 
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Figure 1 : The output from Stanford Parser. The tokens in the "modifiee " column marked with squares are 
head nouns, and the corresponding tokens along the same rows in the "word" column are the modifiers. 
The first column "offset" is subsequently represented using the variable /. 



ford Parser. Formally, given that s = Uxbay where 
b is any preposition, the conjunction "and" or an 
empty string, the problem is to determine whether 
to accept s as an independent lexical unit (i.e. a 
term candidate) or leave and ay as separate 
units. In order to decide on the merge, we need ad- 
equate evidence that s will form a stable unit and 
hence, a better term candidate than and ay sep- 
arated. It is worth mentioning that the size (i.e. 
number of words) of a^ and ay is not limited to 
1. For example, we can have ax=" National In- 
stitutes", b= "of" and "Allergy and Infectious 
Diseases". In addition, the size of a^ and ay should 
have no effect on the determination of their unit- 
hood. 

The most suitable existing measure for gather- 
ing evidence about the dependency between two 
words is mutual information. Based on the con- 
ventional practice, the frequency of occurrence of 
each element in W = {s,ax,ay} is normalized us- 
ing the sum of the frequency of all occurrences in 
W. Since data sparseness in the local corpus may 
lead to poor estimation of mutual information, we 
innovatively employ the page count by Google 
search engine for calculating the dependency of 
the elements in W instead. We treat the World 
Wide Web as a large general corpus and Google 



search engine as a gateway for accessing the doc- 
uments in the corpus. Our choice of using Google 
to obtain the page count was merely motivated by 
its extensive coverage. In fact, it is possible to em- 
ploy any search engines on the World Wide Web 
for this research. Each element in W is formulated 
as a search query and submitted to Google search 
engine. The page count returned is utilised for cal- 
culating the mutual information. In addition, we 
also apply a multiplier within the range of [e^' , 1], 
inspired by TF-IDF, to offset too common terms, 
especially in three-word terms such as "Institute 
of Science". Formally, for each w G W, we define 
the weight as: 

p{w) = g<-S^) (3) 

where n^ is the page count (i.e. number of doc- 
uments) returned by Google search engine con- 
taining w GW. We only take into consideration 
the number of documents that contain the word 
sequences (i.e. page count) due to the difficulty 
in obtaining the actual frequency of occurrences 
of word sequences from Google's search results. 
Next, the mutual information between the two 
units % and ay is defined as: 



« (START-OF-HASH) «Jhey{2) « are(2) « livei;4) « with(6) « HlVfZ) « in(8} « the[9) 
« brainCIO) « KathyC14) Kopnish:yC15) ^< of(16) « the(17) =:< NIH[1S) « National(20) 
« lnstitute[21) ^< of(22} « Mentall231 « Health[24) » which(26) \s[27) >-> Bpencl(2@) 
» million(30) investigate (31 } » neuraAIDS(32} » (END-OF-HASH) » 



« [START-OF-HASH) «They[2) « are[3) « live[4) « with[6) « HIV(7) in[S) «the[9) 
brainf10)« Ka\hy(14)« Kopnisl^yCIS) « of[16) « the[17) «NIHf13)« HationallSOl 
« lnstitute[21) of[22) » Mental[23} Health [24} =■> which[26) » is[27) » spend[28} 
» million[30) investigate [31 ) » neuroAIDS[32) » [END-OF-HASH) »• 



« [START-OF-HASH) =:<They(2; are(3) « iive[4) « with [6) « HiVC7) « in [3) « the [9) 
« brain[10) « Kathvf14H < Kopnisl<y[15) » of[16) the[17) NiH[18) » Nationai(20) 
» institute[21) □f(22} » Mentaii;23) Heaith[24) » which[26) » is[27) » spend[28) 
>^ miiiion(30) » investigate [31 ) » neuroAiDS(32) » (END-OF-HASH) ^> 



Figure 2: An example of our head-driven left-right filter at work. The tokens which are highlighted with 
a darker tone are the head nouns. The underlined tokens are the modifiers identified as a result of the 
left-first and right-later movement of the filter. In the first segment, the head noun is "Health " while 
the first token to the left "Mental" is the corresponding modifier. In the second segment of the figure, 
"Institute" is the head noun and "National" is the modifier. Note that the results from the first and second 
segments are actually part of a longer noun phrase. Due to the restriction in accepting prepositions by 
the head-driven filter, "of" is omitted from the output of the first segment of the figure. 
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If the occurrence of s approaches (i.e. ax and 
ay are rarely seen together as a unit with b), then 
MI{ax,ay) reduces to 0. If the occurrence of s ap- 
proaches that of the co-occurrence of a^ and ay, 
MI{ax,ay) will approach 1. A high p{s) indicates 
stronger coupling of Ox and ay, or even more so, 
the existence of % and ay are purely due to the 
existence of s. A high MI {ax, ay) implies an in- 
crease in the unithood of the two units ax and ay. 
Following this, ax and ay will have poor unithood 
in their individual forms if they are not merged 
into s to form a stronger unit. This mutual in- 
formation measure is necessary in distinguishing 
phrases such as "Asia and Europe" (i.e. low mu- 
tual information) from "U.S. Food and Drug Ad- 
ministration" (i.e. high mutual information) when 
prepositions and conjunctions are involved. 

Nonetheless, the units ax and ay may still be 
capable of forming valid compound unit s even 
though their mutual information is relatively low. 
Low mutual information can be attributed to the 
high individual occurrences of ax and a,, due to 
their extremely common usage. For example, 
ax=" Institute" and ay=" Ophthalmology" yield 
high occurrences relative to s= "Institute of Oph- 
thalmology". This does not mean that "Institute 
of Ophthalmology " is not a valid unit. To han- 
dle such cases where MI {ax, ay) is mediocre due to 



the commonness of ax and ay, we employ another 
measure of independence. In such situation, we 
will still accept s as a valid unit if it can be demon- 
strated that the extremely high independence of 
the individual unit ax and ay is the cause behind 
the low MI{ax,ay). For this purpose, we modify 
the Cvalue described in Equation |2] to accommo- 
date the use of page counts rather than frequency. 
In addition, we remove the multiplier log2 \a\ be- 
cause the number of words in ax and ay does not 
play a role in determining their independence from 
s. Consequently, we define the measure of Inde- 
pendence (ID) for ax and ay from s as: 

ID{ax,s) = { ^1°^ " ' ^ ' ' (5) 
1 otherwise 

iTM ^ JlogioK -w^O if(«fl,, > ?^.v) 
ID{ay,s) = 1 . (6) 

1 otherwise 

where and ng is the Google page count for 

the unit ax, ay and s, respectively. As the lexical 
unit ax occurs more than its longer counterpart s, 
its independence ID{ax,s) grows. Only when the 
number of occurrences of % is less than those of 
s, its independence from s becomes ID{ax,s) = 
0. This means that we will not be able to wit- 
ness ax without encountering s. The same can 
be said about the measure of independence for ay, 
ID{ay,s). In short, extremely high independence 
of ttx and ay relative to s will be reflected through 
highID{ax,s) mA ID{ay,s). 



UH{ax,ay) 



Consequently, the decision to merge and ay 
to form s depends on both the mutual information 
between ax and ay, namely, MI {ax, ay), and the in- 
dependence of ax and ay from s, namely, ID{ax,s) 
and ID{ay,s). This decision is organised into a 
Boolean function known as Unithood (UH), and 
we define it as: 

'l if {MI{ax, ay) >MI+)y 
{MI+ >MI{ax,ay) 
>M/-A 

ID{ax,s) >IDt a 
ID{ay,s) >IDt a 
IDR+ >IDR{ax,ay) 
>IDR ) 
otherwise 

(V) 

where IDR{ax,ay) = ID{ax,s)/ID{ay,s). IDR 
helps to ensure that pairs with mediocre mutual 
information not only have and ay with high in- 
dependence but are also equally independent be- 
fore mergings are performed. The unithood func- 
tion in Equation |7]summarises the relationship be- 
tween mutual information and the independence 
measure. UH{ax,ay) simply states that the two 
lexical units ax and ay can only be merged in two 
cases: 

• If ax and ay has extremely high mutual infor- 
mation (i.e. higher than a certain threshold 
MI+); or 

• If ax and ay achieve average mutual infor- 
mation (i.e. within the acceptable range of 
two thresholds M/+ and Mr) due to both 
of their extremely high independence (i.e. 
higher than the threshold IDj) from s. To 
ensure that both units have equally high in- 
dependence, their ratio of independence IDR 
has to fall within the range IDR and IDR^ . 

The thresholds for MI{ax,ay), ID{ax,s), ID{ax,s) 
and IDR{ax,ay) are decided empirically through 
our evaluations: 

• MI+ = 0.9 

• Mr = 0.02 

• IDt = 6 

• IDR+ = \35 

• IDR = 0.93 



A general guideline for determining the appropri- 
ate threshold values is provided in the next section. 
Finally, the word sequence s = axbay will be ac- 
cepted as a stable lexical unit (i.e. term candidate) 
if and only if UH{ax,ay) = 1. 

4 Evaluations and Discussions 

Table 1: Contingency table constructed using the 
actual and ideal results for computing precision, 
recall, accuracy and F-score. Ideal results are 
used as a reference for evaluation. Actual results 
are the actual output from our new approach. 
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For this evaluation, we employ 300 news arti- 
cles from Reuters in the health domain gathered 
between October 2006 to January 2007. These 
300 articles are fed into the Stanford Parser whose 
output is then used by our head-driven left-right 
filter to extract word sequences in the form of 
nouns and noun phrases. Pairs of word sequences 
(i.e. ax and ay) located immediately next to each 
other, or separated by a preposition or the conjunc- 
tion "and" in the same sentence are measured for 
their unithood. Based on the UH{ax,ay) of the 
pairs, the decisions on whether to merge or not are 
done automatically. These decisions are known as 
the actual results. At the same time, we inspect 
the same list manually to decide on the merging 
of all the pairs. These decisions are known as 
the ideal results. Using the 300 news articles, we 
managed to obtain 1005 pairs of words to be tested 
for unithood. The actual and ideal results are or- 
ganised into a contingency table as shown in Table 
1 to identify the true and the false positives, and 
the true and the false negatives. Using the results 
in Table 1, we obtained a precision of 98.68%, a 
recall of 91.82% and an F-score of 90.61%. As 
for the accuracy, our new measures for unithood 
scored 95.42%. It shows that our new measure has 
very good precision and a relatively low recall due 
to the high number of false negatives. 

Firstly, we realised that the high false negative 
rate is explained by our more conservative defi- 
nition of the thresholds namely IDj, IDR^ and 
IDR . We discovered that about 90% of the false 
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Figure 3: This figure sliows a snapsliot of some samples of false positives and false negatives taken from 
our evaluations. Row 66 is an example of merged pair which is not supposed to be combined (i.e. false 
positives) while the remaining rows are false negatives. Each row has two lexical units Qx (column 2) and 
Qy (column 4) to be examined for their mutual information (column 8), independence from s (column 5 
and 6) to determine if they should be merged into s (column 10). The decision to merge (column 9) is 
accomplished based on Equation [71 



negatives fall within the range of M/+ and MI^ 
(i.e. mediocre mutual information). Such pairs 
have the opportunity to be merged if they demon- 
strate adequate independence from s. Unfortu- 
nately, most of the independence ID of either one 
or both members of the pairs failed to satisfy our 
independence thresholds. For example, referring 
to rows 72, 120 and 567 as shown in Figure |3l one 
will notice that they have mediocre mutual infor- 
mation as defined by M/+ = 0.9 and Mr = 0.02. 
In the case of row 72, both a,- and ay have an inde- 
pendence lower than the threshold IDj = 6. This 
resulted in the decision of not merging them. The 
same case happened to row 120. In row 567, only 
Qy has an independence lower than IDt and at the 
same time, their IDR is well above the upper limit 
IDR^. The remaining 10% of the false negatives 
are simply due to the extremely low mutual infor- 
mation MI {ax, ay) of the pairs. Take for exam- 
ple pair 171 in Figure |3] where the mutual infor- 
mation is only 0.002, which is way below MI^. 
Secondly, due to the small number of false posi- 
tives, not much conclusion can be drawn. From 
our analysis of the results, most of the false posi- 
tives are due to their high mutual information (i.e. 
MI{ax,ay) above MI^). Pair 66 in Figure [3] is an 
example which is incorrectly merged due to a mu- 
tual information value higher than M/+. 

From our discussion above, one would realise 



that both the recall and the precision can be im- 
proved by adjusting the various thresholds. For 
most of the time, the improvement of one comes 
at the expense of the other. For example, we can 
improve the recall by lowering IDj and broaden- 
ing the range between IDR^ and IDR at the ex- 
pense of precision. In other words, more pairs 
with ID values exceeding the threshold IDt and 
more pairs will fall within the range of acceptable 
IDR. In this case, we will lower the number of 
false negatives and hence, higher recall. Similarly, 
we can improve the precision by increasing M/+. 
In this case, an increasing number of pairs will 
have mediocre mutual information (i.e. within the 
range M/+ and MI^). Consequently, the number 
of false positives will reduce when such pairs with 
mediocre mutual information are not merged due 
to their inability to satisfy additional constraints in 
the forms of IDt, IDR+ and IDR . 

Due to the lack of existing dedicated techniques 
for measuring unithood, we were unable to per- 
form a comparative study. Nonetheless, the high 
accuracy and F-score presented during our evalu- 
ation, and our analysis on the false positives and 
the false negatives revealed the potentials of our 
new measures in terms of high precision and re- 
call, portability across domains, and configurabil- 
ity of the performance. 



5 Applicability to Named-Entity 
Recognition 

Named-entity recognition is one of the impor- 
tant tasks in information extraction. It involves 
the identification of noun phrases or more specifi- 
cally, proper names from free text, and their clas- 
sification into one of the many categories such 
as persons, geograpfiical locations and compa- 
nies. A typical named-entity recogniser per- 
forms part-of-speech tagging, and rely on patterns, 
heuristics and dictionary to identify proper names. 
The use of machine learning and other proba- 
bihstic methods such as support vector machines 
dMavfield et all (12003)) and hidden Markov mod- 
els (|Mittal et al. ( 1999i )) have also gained popular- 
ity. 

Most of these existing techniques for named- 
entity recognition works well when they are deal- 
ing with single-word names or sequences of 
nouns. In the face of more complex named- 
entities that consist of other part-of-speech es- 
pecially prepositions, these techniques performed 
poorly. For example, most existing techniques 
would have to rely on heuristics and dictionar- 
ies to differentiate between "Barrow in Fumess " 
and "countries in Asia". Unfortunately, there 
are many problems related to the use of heuris- 
tics and dictionaries. For one, the maintenance of 
such dictionaries are costly and difficult. Ques- 
tions about the reliability of stop-word bound- 
aries, words capitalisation and punctuations arise 
regarding the use of heuristics and linguistics cues. 
For example, how can named-entity recognisers 
rely on capitalised words in the case of "Bar- 
row in Fumess" versus "Perth in Australia"! 
Besides prepositions, the conjunction "and" in 
named -entities posed similar challenge . Accord- 
ing to lOsenova and Kolkovskal (120021) . "...prob- 
lem arises when named-entity is a phrase, com- 
prising a conjunction..." . In such cases, the 
named-entity recogniser only recognises part of 
the named-entity. For example, in the case of 
"Centre for Disease Control and Prevention" , 
only "Centre for Disease C ontrol" wil l be ex - 



tracted. Other researchers (IMani et al.l (119961) ) 



have taken the default step of simply grouping 
name segments separated by prepositions or con- 
junctions into longer names. 

Our new approach of deciding on whether two 
word sequences are to be merged or not is highly 
applicable to many areas in natural language pro- 



cessing especially named-entity recognition. The 
absence of any predefined resources in our ap- 
proach will solve all the problems highlighted in 
the previous paragraph. Using our UH{a^,ay) 
function, named-entity recogniser can easily de- 
termine whether or not parts of proper names 
should be merged together without ever relying on 
unreliable heuristics, and domain-restricted pat- 
terns and dictionaries. 

6 Conclusion and Future Work 

Many researchers inappropriately assume that ter- 
mhood subsumes unithood. In this paper, we high- 
lighted the significance of unithood and that its 
measurement should be given equal attention by 
researchers in automatic term recognition. The po- 
tential of unithood measurements can be extended 
to other areas in natural language processing such 
as noun-phrase chunking and named-entity recog- 
nition. 

We proposed a new approach that provides ded- 
icated measures specialised in measuring unit- 
hood. The first measure employs mutual informa- 
tion MI{ajf,ay) to capture the interdependence of 
the existence of a^ and ay, and allows us to deter- 
mine if ax and ay are better off separated or other- 
wise in order to produce a stronger unit. The sec- 
ond measure, defined as Independence (ID) was 
inspired by Cvalue, and is meant to provide addi- 
tional evidence in the determination of unithood. 
These two measures are combined into a Boolean 
function defined as Unithood (UH) that decides 
on whether a^ and ay should be combined to form 
s. 

Our evaluations revealed a precision and recall 
of 98.68% and 91.82% respectively with an accu- 
racy at 95.42% in measuring the unithood of 1005 
test cases. Due to the lack of existing dedicated 
techniques for measuring unithood, we were un- 
able to perform a comparative study. Nonetheless, 
the excellent evaluation results together with the 
real-world text employed in our evaluation demon- 
strate the strengths of our new approach in regards 
to the determination of unithood. One of the future 
works that we plan to undertake is to increase the 
size of our test set to further establish the advan- 
tages of our new approach demonstrated through 
the current evaluations. We are planning to reduce 
the number of thresholds and at the same time, find 
ways to automatically optimise them. 
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